class: center, middle, inverse, title-slide #
Support Vector Machines
## 🚧 🛣️ 🦿 ### Applied Machine Learning in R
Pittsburgh Summer Methodology Series ### Lecture 4-A July 22, 2021 --- class: inverse, center, middle # Overview <style type="text/css"> .onecol { font-size: 26px; } .twocol { font-size: 24px; } .remark-code { font-size: 24px; border: 1px solid grey; } a { background-color: lightblue; } .remark-inline-code { background-color: white; } </style> --- class: onecol ## Roadmap 1. Maximal Margin Classifier 🚶 1. Support Vector Classifier 🏃 1. Support Vector Machine 🚴 1. Applied Example 1. Support Vector Regression 1. Live Coding 1. Hands-on Activity --- class: onecol ## Notices and Disclaimers The ideas underlying SVM are really *clever* and *interesting*! 😃 -- SVM is also a good algorithm for *smaller*, *messy* datasets!! 😍 -- However, there is a lot of *terminology* and *math* involved... 😱 -- <p style="padding-top:15px;">I will try to shield you from this and give only <b>the necessities</b></p> - That means there will be some things I need to "hand waive" - I may also need to skip questions with very technical answers -- <p style="padding-top:15px;">But you should get a <b>strong intuition</b> and <b>applied knowledge</b></p> - This will prepare you nicely to dive into a longer course on the topic --- class: inverse, center, middle # SVM Intuitions --- class: onecol ## A Tale of Two Classes If this is our training data, how do we **predict the class** of new data? <img src="data:image/png;base64,#maxmargin1.png" width="100%" /> --- class: onecol ## Drawing a Line in the Sand With one feature, we could find a **point** that separates the classes (as higher or lower) <img src="data:image/png;base64,#maxmargin2.png" width="100%" /> --- class: onecol ## Analysis Paralysis But there are many possible decision points, so **which should we use?** <img src="data:image/png;base64,#maxmargin3.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier (MMC) The MMC algorithm finds and uses the point with the **largest** (i.e., maximal)  <img src="data:image/png;base64,#maxmargin4.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier If we have two features, we can extend this idea using a 2D plot and a decision **line** <img src="data:image/png;base64,#maxmargin5.png" width="80%" /> --- class: onecol ## Maximal Margin Classifier If we have three features, we will need a 3D plot and a decision **plane** (i.e., flat surface) <img src="data:image/png;base64,#3d_plane.gif" width="45%" /> .footnote[[1] Credit to [Zahra Elhamraoui](https://medium.datadriveninvestor.com/support-vector-machine-svm-algorithm-in-a-fun-easy-way-fc23a008c22) for this visualization.] --- class: onecol ## Maximal Margin Classifier If we have four or more features, we will need a decision **hyperplane** -- .bg-light-yellow.b--light-red.ba.bw1.br3.pl4[ **Caution:** You may hurt yourself if you try to imagine what a hyperplane looks like. .tr.pr4[ But here is a hint: 🍫✈️ ] ] -- .pt1[ **Margins still exist** in higher-dimensional space and we still want to maximize them ] - Our goal is thus to locate the class-separating hyperplane with the largest margin - The math behind this is beyond the scope of our workshop, but that's the idea -- .pt1[ We can still **classify new observations**: which side of the hyperplane do they fall on? ] --- class: onecol ## Maximal Margin Classifier Only the observations that define the margin, called , are used Because it only uses a subset of data anyway, MMC does well with smaller datasets <img src="data:image/png;base64,#maxmargin8.png" width="80%" /> --- class: onecol ## Maximal Margin Classifier This means that **outliers can have an outsized impact** on what is learned For instance, this margin is likely to misclassify examples in new data <img src="data:image/png;base64,#maxmargin9.png" width="80%" /> --- class: onecol ## Support Vector Classifier (SVC) The SVC algorithm is like MMC but it **allows examples to be misclassified** (i.e., wrong) This will **increase bias** (training errors) but hopefully **decrease variance** (testing errors) <img src="data:image/png;base64,#svc1.png" width="80%" /> --- class: onecol ## Support Vector Classifier SVCs also enable a model to be trained when the classes are **not perfectly separable** A straight line is never going to separate these classes without errors (sorry MMC...) <img src="data:image/png;base64,#svc2.png" width="80%" /> --- class: onecol ## Support Vector Classifier But if we allow a few errors and points within the margin... ...we may be able to find a hyperplane that generalizes pretty well <img src="data:image/png;base64,#svc3.png" width="80%" /> --- class: onecol ## Support Vector Classifier When points are on the wrong side of the margin, they are called "violations" A **softer margin** allows more violations, whereas a **harder margin** allows fewer SVCs have a hyperparameter `\(C\)` that controls how soft vs. hard the margin is -- - A **lower `\(C\)` value** makes the margin harder (allows fewer violations)<sup>1</sup> As a result, the model has **lower bias** and more flexibility but may overfit - A **higher `\(C\)` value** makes the margin softer (allows more violations) As a result, the model has less flexibility but may also have **lower variance** .footnote[[1] If you set `\\(C=0\\)` (i.e., a fully hard margin) SVC will allow no violations and behave the same as MMC.] --- class: onecol ## Support Vector Machine So far, MMC and SVCs have both used linear (e.g., flat) hyperplanes But there are many times when the classes are not **linearly separable** <img src="data:image/png;base64,#svm1.png" width="80%" /> .footnote[[1] Good luck separating these classes with a single decision point...] --- class: onecol ## Support Vector Machine But if we enlarge the feature space, the classes might then become linearly separable There are many ways to do this enlarging, but one is to add polynomial expansions <img src="data:image/png;base64,#svm2.png" width="80%" /> --- class: onecol ## Support Vector Machine The classes are now linearly separable in this new enlarged feature space! <img src="data:image/png;base64,#svm3.png" width="80%" /> --- class: onecol ## Support Vector Machine Here is a more complex example of a nonliner (and non polynomial) expansion <img src="data:image/png;base64,#svm4.png" width="75%" /> .footnote[[1] Credit to [Erik Kim](https://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html) for this example and visualization.] --- class: onecol ## Support Vector Machine And here is the hyperplane (linear in 3D but nonlinear when "projected" back in 2D) <img src="data:image/png;base64,#svm5.png" width="75%" /> .footnote[[1] Credit to [Erik Kim](https://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html) for this example and visualization.] --- class: onecol ## Support Vector Machine The  (SVM) allows us to efficiently enlarge the feature space - Part of what makes SVMs efficient is they **only consider the support vectors** - They also use **kernel functions** to quantify the similarity of pairs of support vectors<sup>1</sup> .footnote[[1] These similarity estimates are used to efficiently find the optimal hyperplane but that process is complex.] -- <p style="padding-top:15px;">The SVC can actually be considered a simple version of the SVM with a <b>linear kernel</b></p> - A linear kernel essentially quantifies similarity using the Pearson correlation `$$k(x, x') = \langle x, x'\rangle$$` -- <p style="padding-top:15px;">Linear kernels are efficient but <b>nonlinear kernels</b> may provide better performance</b></p> --- exclude: true class: onecol ## Support Vector Machine It is common to also use **nonlinear** kernels, such as the  `$$k(x, x')=(\text{scale} \cdot \langle x, x' \rangle + \text{offset})^\text{degree}$$` With larger values of `\(\text{degree}\)`, the decision boundary can become more complex - You are essentially adding polynomial expansions of `\(\text{degree}\)` to each predictor - You have expanded the feature space and may now have linear separation - This is the same idea we just used in fitting a hyperplane in the `\(x\text{-by-}x^2\)` space! If you center or normalize all your predictors, you can drop the `\(\text{offset}\)` term .footnote[[1] When `\\(\text{degree}=1\\)`, the polynomial kernel reduces to the linear kernel and SVM becomes SVC again.] --- class: onecol ## Support Vector Machine Perhaps the most common nonlinear kernel is the  (RBF) `$$k(x, x') = \exp\left(-\sigma \|x-y\|^2\right)$$` -- The intuition here is that similarity is weighted by how *close* the observations are - Only support vectors near new observations influence classification strongly - As the `\(\sigma\)` hyperparameter<sup>1</sup> increases, the more *local* and complex fit becomes The RBF kernel computes similarity between points in *infinite* dimensions 🤯 - It is considered an ideal "general purpose" nonlinear kernel<sup>2</sup> .footnote[ [1] Note that `\(\sigma\)` is also sometimes called `\(\gamma\)` or the "scale" hyperparameter.<br /> [2] Other nonlinear kernels are popular for special purposes and in specialized subfields. ] --- ## Support vector regression --- ## Applied Example --- ## Live Coding Activity --- ## Hands-on Activity --- ## Break and timer